System for sound file recording, analysis, and archiving via the internet for language training and other applications

ABSTRACT

The invention is a system for sound file recording, comparison, and archiving for network-based language and communications training, or other applications. The invention allows capture of multimedia data from a user, and allows the user to play back his or her self-created sound inputs and to view various comparisons of his or her sound inputs with model sounds. The invention displays a waveform or spectrogram of a model sound superimposed over a waveform (or spectrogram) of the user&#39;s sound input. It can display a failure/success indication for the user&#39;s sound input relative to a predetermined standard. Further, the invention allows a user to archive sound files for subsequent review and analysis.

BACKGROUND

[0001] 1. Feild of the Invention

[0002] The present invention relates to systems and methods forrecording, analyzing, and archiving sound files principally for distancelearning applications (such as language learning, communication skillstraining, performance arts training), and speech-language assessmentand/or therapy, or any other application over the World Wide Web, usingInternet-connected computers.

[0003] 2. Background of the Invention

[0004] In today's global environment, good communication skills areincreasingly important. These skills are invaluable in businesssettings, both domestically and abroad. They are also essential forsuccess in most careers, educational pursuits, and personalrelationships. The ability to speak one's native language clearly andprecisely and to accomplish specific communicative tasks (such aspersuasion or conflict resolution in a context-specific manner), is akey to success. The ability to speak a foreign language with clarity andfluency is also necessary when working, living, or traveling in anothercountry.

[0005] Teachers, trainers, and other instructional professionals usuallyconduct language learning, accent coaching, foreign/second languagelearning and native language communication skills training courses withgroups of students who gather in classrooms or language laboratories atschools, colleges, and other educational institutions. They also teachonsite in corporate settings. Current methodologies may include one ormore of the following support materials and technologies: blackboards,whiteboards, textbooks, graphics, tape players, VCRs, and audio- andvideotapes.

[0006] Classroom learning is not practical for many people due to timeand cost constraints. In addition, taking time away from work or otherresponsibilities to take language learning or communications skillsdevelopment courses at an institution is difficult for many people.Therefore, many people purchase self-study programs that they can followin their free time. These programs include audio- and videocassettes andCD-ROM formats.

[0007] The benefits of these self-study courses are that they arelow-cost, people can do them at any time, and they can do them alone inthe privacy of their own home. This last benefit is especially importantin that many people feel less inhibited about acquiring and practicingcommunication skills when they are alone. In fact, the fear of makingmistakes or seeming foolish or childlike in front of others inhibitsmany people from succeeding in foreign language and communicationcourses.

[0008] In addition to the demand for self-study programs, class size atmany institutions limits the opportunities for students of foreign orsecond languages and communication skills to get adequate individualattention. Therefore, there is an increased need for a customizable,home- or lab-based learning environment that is available twenty-fourhours a day to supplement in-class learning. To serve this need,web-based and CD-ROM based learning programs have been developed.

[0009] Existing web-based language learning programs address the needfor individualized, low cost and convenient access. However, theseprograms do not offer on-line sound capture/archiving and spectrographiccomparison capabilities which facilitate oral language and speechcommunication skill development.

[0010] Though some CD-ROM-based programs allow sound capture, playback,comparing, archiving, and meet the need for relative low cost andconvenient access, they unfortunately preclude immediatestudent/teacher, student/trainer, and student/student interaction.Further, they cannot be modified easily by publishers or instructors inorder to meet the changing needs of students.

[0011] Therefore, a readily accessible, instructor- orpublisher-modifiable, World Wide Web-based system with a selection oftools for teaching and learning languages and communication skills canprovide a better way for many students throughout the world to acquirethese skills.

[0012] Drama coaches and music teachers also usually instruct studentsin a classroom, studio, or theater setting, occasionally using audio- orvideotapes for modeling and feedback in practice sessions. Practice is alarge, necessary part of learning dramatic roles and music and isusually done in isolation. Students must do it away from the classsetting and often have difficulty finding the motivation to practice ontheir own. An easily accessible tool for practice can assist a studentin mastering his or her art more rapidly.

[0013] Also, because instrumental music and voice instruction is mostfrequently conducted one-on-one, it can prove expensive and inconvenientfor many students. Pre-programmed, computerized keyboard learningprograms and CD-ROM-based guitar learning programs offer a convenientand inexpensive alternative and are popular choices for some. However,as with CD-ROM-based language programs, existing CD-ROM-based musictraining programs preclude immediate student/teacher, student/trainer,and student/student interaction. Further, they cannot be modified easilyby publishers or instructors in order to meet the changing needs ofstudents.

[0014] In another application involving speech-language disorderevaluation, analysis, and therapy, children and adults withspeech-language disorders are diagnosed and treated in clinic, lab, andclassroom settings. Speech-language disorders may include aphasia,neurogenic communication disorders, autism spectrum disorders, andhyperfunctional voice disorders. Speech-language pathologists usespecialized audio equipment and computer programs to analyze speechdisorders and to provide clients with therapeutic verbal activities.Sound spectrographs are used for analysis and feedback. Using the systemfunctions, clients can develop needed auditory discrimination, speechpattern recognition, and relaxation techniques.

[0015] Because it is difficult for many stroke- or other neurologicallydisabled patients and children with speech-language disorders to betransported to access the professional help they need, the presentinvention allows easy and frequent access to treatment. Speech-languagepathologists and speech therapists can obtain on a regular basis overtime, and at their clients' convenience, verbal sound samples from whichdiagnosis and treatment can be determined. They can then distribute viathe Internet therapeutic activities that employ self-feedback, orfeedback that can be monitored off-site by a care giver. This allowsclients to obtain the therapy they need in the convenience of theirhome, nursing care or assisted-living facility setting, and at lowercost. Though many speech-language pathologists and therapists will nodoubt need to continue seeing patients in the traditional professionalsetting as well, they can readily supplement in-office (or clinic orlab) diagnosis and therapy with at-home activities and treatment.

SUMMARY

[0016] The invention provides an apparatus and a method for trainingusers over a network. The training method includes capturing multimediadata from a user; providing feedback to the user by allowing the user toplay, compare and capture multimedia data; and archiving the capturedmultimedia data over a network.

[0017] Implementations of the invention include one or more of thefollowing. An applet-type program can be downloaded for capturing themultimedia data from a user and one or more multimedia source files. Thecaptured multimedia data can be compared against one or more multimediasource files. Waveforms associated with the captured multimedia data canbe shown to the user for review. Spectrograms associated with thecaptured multimedia data can also be displayed to the user for review.The spectrogram associated with the captured multimedia data can beshown superimposed over a spectrogram associated with the one or moremultimedia source files for comparison. The multimedia data can bespeech, audio, or video data. The applet-type program can be a JavaSound applet. The captured multimedia data can be stored in a memorystructure. The captured multimedia data can also be uploaded to a remoteserver for archival.

[0018] Advantages of the invention include one or more of the following.The present invention provides a suite of online learning tools whichcan be used individually or in combination to facilitate the acquisitionand analysis of communication skills via computer network technology. Itcan be used for any of the following didactic and/or diagnosticpurposes: training in areas such as the spoken aspects of a second orforeign language; training in targeted business communication skillssuch as pronunciation, voice tone, and pitch, speaking pace, formalitylevel, vocabulary development, and other communicative strategies suchas persuasion; for teaching voice and instrumental instruction and fordrama coaching; or for speech-language pathology diagnosis and therapy;or for any other sound-augmented training or instruction. It can also beused for other Internet-based sound capture and communications purposes.

[0019] The invention addresses at least five problems faced by languagestudents and communication skills trainees in a classroom or languagelaboratory settings:

[0020] the expense

[0021] the inconvenience

[0022] inhibition on the part of the student

[0023] the lack of individualization

[0024] publishers' inability to modify fixed media (such as CD-ROMs) fortimely response to learners' needs

[0025] In language learning and communication skills development, theinvention allows students to acquire, practice, and perfect skills atany time, in the privacy of their own home and at their own pace. Itdoes away with the inefficiencies associated with traveling to aclassroom and with conforming to specific class schedules. Further, iteliminates the discomfort and tension that some communication skillsstudents experience, which greatly inhibits acquisition. It allows formaterial pronounced from audio or video clips to be customized to thestudent's own pace and requirements. The student can take advantage ofthe interactivity of Web-based learning in combination with traditionaleducational tools such as textbooks. For instance, the student can studyfrom a textbook while viewing supplemental text and listening to samplematerial pronounced from any one of a choice of downloaded audio orvideo clips, which can be customized by the instructor or publisher tomeet the students'/users/needs.

[0026] In addition, the student has complete control over theinstructional medium as a function of his or her specific choices viainteractive commands. Moreover, the student can learn from multipleaudio streams or files originating from one or more Internet sites. Anyone of a choice of downloaded audio streams or files may be selectedusing interactive commands issued by the student.

[0027] Further, interactive instructors and publishers can access aserver and upload audio as well as other multimedia files, such as videoclips along with suggested lessons, exercises, and activities. Theinstructor or publisher can sequence the audio clips using suitableauthoring tools in combination with the system functions to create aninteractive communication skills learning program tailored to his or herstudents' needs. In this manner, the process of acquiring foreignlanguage and communication skills can be interactive and moreindividualized and thus, more enjoyable than other traditional ways oflearning such material.

[0028] Further, the invention provides graphical displays that enhanceacquisition of material by providing an additional channel ofinformational input. Added sensory stimulation provided by the visualrepresentation of their oral performance can facilitate learning forstudents whose learning styles rely on visual more than aural modes.

[0029] Finally, book and other content publishers also benefit, asupdates and revisions may be published on the web to reduce the need toprint new editions, and these may be made interactive using the system.Moreover, web-based communication skills training can incorporatewritten materials such as textbooks and extends these materials withmultimedia supplements to avoid obsolescence due to the ubiquity of theweb as a publishing medium.

[0030] For performance arts learning, the invention makes instructionmore cost effective and allows students an opportunity to learn andpractice any instrument, musical piece, or role at his or herconvenience in an interactive and possibly more motivating way. It alsoprovides music teachers or drama coaches with the opportunity tosupplement their teaching and monitor students' practice via theInternet.

[0031] In addition, the invention makes it much easier for manyspeech-language disorder clients, such as those dealing with stroke orother neurological problems for whom special transportation may requireadded expense and hardship, to gain access to the diagnosis andtreatment they need.

[0032] Additionally, content providers may splice audio-visualadvertisements into their content as it is delivered. By virtue of thedemographic information that may be available to the content providersvia the system, it may be possible to target specific student/users withspecific commercials. This targeting, which is an extension of thecontrolled access to content described later in the document, may allowcontent to be delivered on a geographic basis and blackouts to beestablished based on business requirements.

BRIEF DESCRIPTION OF THE FIGURES

[0033]FIG. 1 is a schematic illustration of a networked system forteaching and assessing language, communication and performing artsskills, and for analyzing, diagnosing, and treating speech disorders.

[0034]FIG. 2 is a flowchart illustrating all the possible combinationsof processes for training a student/user.

[0035]FIG. 3A is a flowchart illustrating a process for recording thestudent/user's sound input.

[0036]FIG. 3B is a flowchart illustrating a process for playing thestudent/user's sound input.

[0037]FIG. 4 is a flowchart illustrating a process for checking thestudent/user's sound input.

[0038]FIG. 5 is a flowchart illustrating a process for visuallycomparing the student/user's sound input against a model sound.

[0039]FIG. 6 is a flowchart illustrating a process for archiving thestudent/user's sound input.

[0040]FIG. 7 is a flowchart illustrating a process for handlingstudent/user requests at a server.

[0041]FIG. 8 is a schematic diagram of an exemplary student/userworkstation.

DESCRIPTION

[0042] Exemplary environments in which the techniques of the inventioncan be usefully employed are illustrated in FIG. 1. FIG. 1 shows asystem 100 which includes one or more student/user workstations 112 and114 connected to a network 110 such as the Internet. Also connected tothe network 110 are one or more servers 116 and 118 which providematerials such as files relating to training, exercises or activitiessuitable for downloading to the student/user workstations 112 and 114.

[0043] In another embodiment, the student/user workstations 112 and 114can be attached to a network such as a telephone network, which in turnis attached to an Internet Service Provider (ISP) proxy server. The ISPproxy server is connected to another network which provides connectionsto desired servers 116 and 118.

[0044] The student/user workstation 112 or 114 is a multimedia computerand contains a sound board which captures speech directed at amicrophone. Speech is a product of the interaction of respiratory,laryngeal and vocal tract structures. The larynx functions as a valvethat connects the respiratory system to the airway passages of thethroat, mouth and nose. The vocal tract system consists of the variouspassages from the glottis to the lips. It involves the pharynx, and theoral and nasal cavities, including the tongue, teeth, velum, and lips.The production of speech sounds through these organs is known asarticulation.

[0045] When the student/user speaks into the microphone, changes inpressure produced by the student/user's larynx and sensed by themicrophone are converted to proportional variations in electricalvoltage. In addition to the sounds produced by the larynx, sounds can beproduced in other parts of the vocal tract. These sounds are usuallymade by forcing air to flow through narrow openings. For example, whenan “s” sound is made, air is forced between the tongue and the roof ofthe mouth (the palate). The turbulence created with this air flowproduces the desired sound. Vowels are typically in the low frequency,and consonants are in the high frequencies. The speech-relatedinformation is captured by the microphone as analog voltages.

[0046] The sound board converts analog voltage variations generated intodigital sound waveforms using analog-to-digital conversion (ADC). Speechcan be sampled at different rates, currently between 8000 kHz and 44.1kHz. A higher the sampling rate yields better sound quality, butrequires the transmission of greater amounts of data, and thus, largersound files. Also, each sampled pressure value is rounded or quantizedto the nearest value. The quantization level can be 8-32 bits, with16-bits being typical.

[0047] The student/user workstation 112 or 114 also includes a networkbrowser such as Web browser software for viewing a web page. In additionto text and images positioned on the web page, the web page is enhancedby various executable programs (currently and commonly referred to as“applets” or “applet-type programs”) attached to it. These programs maybe downloaded and executed by the web browser along with the text,graphics, sound and video associated with the web page.

[0048] These executable programs are constructed from a particular typeof programming language, one of which is the Java language, availablefrom Sun Microsystems, Inc. of Mountain View, Calif. Java is executed bythe web browser to enhance the web page with additional functionalityand represents the current state of the art of this attached executablefile technology as it pertains to the invention, though the functionsand processes described may in future be implemented by otherprogramming languages.

[0049] In particular, one implementation of Java called Java Soundprovides uniform access to underlying platform sound capabilities. JavaSound is part of the Java Media family, which addresses the increasingdemand for multimedia by providing a unified, non-proprietary,platform-neutral solution for incorporating time-based media, 2D fonts,graphics, and images, speech input and output, 3D models, and telephonyin Java programs. By providing standard players and integrating thesesupporting technologies, the Java Media Application Program Interfaces(APIs) enable developers to produce and distribute media-rich contentsuch as educational content.

[0050] Java Sound presently enables Java programs to read and writesampled and synthesized audio data high-level services such ascompression, decompression, synchronization, streaming, containerread/write, and network transport through the Java Media Framework(JMF). JMF provides a simple, unified way for Java Programs tosynchronize and display time-based data such as audio and video.

[0051] Java Sound provides a very high-quality 64-channel audiorendering and MIDI sound synthesis engine that enables consistent,reliable, high-quality audio on all Java platforms; minimizes the impactof audio-rich web pages on computing resources; reduces the need forhigh-cost sound cards by providing a software-only solution thatrequires only a digital-to-analog converter (DAC). Java Sound supports awide range of audio formats so that audio clips can record and play fromboth applet-type programs and applications. The clips can be any of thefollowing audio file formats: AIFF, AU, WAV, MIDI (Type 0 and Type 1files) and RMF, among others.

[0052] Referring now to FIG. 2, process 200, an overview of all possibletraining processes of the invention is shown. First, a student/user,such as a student, logs-on to one of servers 116 and 118 (FIG. 1)operated by one or more content providers (step 202). The log-on processcan be controlled by a subscription control model where the student/userpays a one-time course fee or a periodic fee (for example monthly) foraccess to the service. Additionally, the system supports a pay-per-viewcontrol model where the student/user pays each time he or she accesses astream on content. In the subscription model the system ensures thatonly valid customers gain access. Once the subscription has beenestablished, access to subscription services is transparent to thestudent/user, unless the subscription expires. In the pay-per-viewmodel, the student/user gains access to the content through a secure webpage. The student/user may enter credit card information or providepayment in some other way. Only when the payment has been validated isthe student/user's player allowed to access the content stream.

[0053] After gaining entry to the content provider's server, thestudent/user accesses one or more multimedia content files, includinglessons, exercises or planned activities provided by the contentprovider (step 204). Next, the invention applet-type program hereindescribed is either downloaded from the same content provider (forexample, an educational publisher), or from some other source such as aseparate educational portal site or server (step 205). The applet-typeprogram is a small downloadable software program such as softwarewritten in Java, or another language.

[0054] When executed, the applet-type program displays a plurality ofstudent/user interface icons to the student/user such as one or morebuttons in a floating panel, frame or diagram. In the embodiment of FIG.2, a plurality of buttons are displayed by the applet-type program onthe computer's screen to facilitate the training process. Depending onthe number of functions to be provided by the content provider or serversite operator, one or more of the following buttons can be shown: arecord button, a playback button; a check button; a compare button; andan archive button. Additionally, an exit button is provided to allow thestudent/user to close the panel, frame or diagram and to end theapplication. Each button may be activated singly or in combination withothers. The following steps are detailed with respect to these buttons.

[0055] In step 206, the process 200 checks whether the record button hasbeen selected. If so, a record module is executed (step 207).Preferably, the recording operation should precede any other operationsuch as playback, check, compare or archive operation. Step 207 is shownin more detail in FIG. 3. From step 206, if the record button has notbeen selected, the process 200 checks whether the play button has beenselected (step 208). If so, data previously captured from thestudent/user is played by the computer (step 209). From step 207 or 209,the process 200 checks whether the student/user has completed thesession (step 242). If the student/user wishes to continue the trainingprocess, he or she clicks the button corresponding to the desiredcomponent process. If the student/user doesn't wish to continue, he orshe clicks the exit button.

[0056] The process 200 also determines whether the check button has beenselected (step 212). If so, the process 200 executes a check module(step 220) before proceeding to step 242 to continue the trainingsession. Step 220 is shown in more detail in FIG. 4.

[0057] The process 200 further checks whether the compare button hasbeen selected (step 222). If so, a compare module is executed (step230). Step 230 is shown in more detail in FIG. 5.

[0058] The process 200 also checks whether the archive button has beenselected (step 232). If so, an archive module, shown in more detail inFIG. 6, is executed (step 240).

[0059] As shown in FIG. 2 and discussed above, each button associatedwith the record, playback, check, compare and archive operations can beused singly or in combination with one or more operations. These buttonscan be executed in any order, depending upon what the student/user wantsto do.

[0060] Upon completing the execution of each of the modules, the process200 loops back to step 242 to check whether the student/user desires toend the training process. If so, the process 200 ends. However,following the process from beginning to end and executing all thecomponents will afford the student/user an optimal learning experience.

[0061] Referring now to FIGS. 3A and 3B, the record and playbackcomponent processes 207 and 209 of FIG. 2 are shown in more detail. InFIG. 3A, the record process 207 is detailed. First, sound input from thestudent/user is captured by the sound system (step 252). The capturedsound input is stored in a data storage structure for subsequentplayback or editing as necessary (step 254). The captured sound inputmay be compressed or uncompressed. The sound input can be stored inmemory or in a data storage device such as flash memory or a disk drive.

[0062]FIG. 3B shows the playback process 209 in detail. In the process209, the sound input of the student/user is retrieved from the memory ordata storage device (step 250). Next, the sound input can be streamed tothe audio system and played for the student/user to provide a multimediaexample to imitate and learn from (step 256) or to allow the user tolisten to his or her own sound input for analysis.

[0063] Through the record/playback functions, the student/user canimitate a sound model and, upon reaching what he or she considers asatisfactory emulation as compared to the sample source sound, cancapture his or her sound input to a data structure in memory fortemporary storage.

[0064] In one embodiment, the student/user clicks on the record buttonto stop the recording of his or her sound input. In a second embodiment,a stop button is provided for the student/user to select when he or sheis finished with the sound input. Alternatively, an automated end-pointdetection process may be used to stop the recording process. Theend-point detection process identifies sections in an incoming audiosignal that contain speech. In one embodiment, the detection processdetects as an end-point when the sound input is silent (no othernoises). Typical algorithms look at the energy or amplitude of theincoming signal and at the rate of “zero-crossings”. A zero-crossingoccurs when the audio signal changes from positive to negative or visaversa. When the energy and zero-crossings are at certain levels, theend-point detection process determines that the student/user has startedtalking and thus captures the sound input into the data storagestructure for storage. Once captured, the student/user's sound may begenerated by playing sound data stored in the data storage structure.

[0065] The student/user can play his or her captured sound and, based onthe reproduced sound, determine for himself or herself whether theproduced sound is satisfactory (step 254). If so, the process 210 exits.If not, the student/user can decide whether the sample file is to beplayed again to refresh himself or herself of the model sound (step256). If so, the process loops back to step 250 where the student/usercan play the downloaded sample multimedia file. Alternatively, theprocess 210 loops back to step 252 to allow the student/user to mimicthe word/phrase and capture the sound input.

[0066] The student/user can click the playback button to play his or hercaptured sound. At this point, the student/user has various options. Oneoption is for the student/user to exit process 210. Another option isfor the student/user to decide whether the sample file is to be playedagain to refresh himself or herself of the model sound (step 256). Ifso, the process loops back to step 250 where the student/user can playthe downloaded sample multimedia file. Alternatively, the process 210loops back to step 252 to allow the student/user to imitate theword/phrase and capture the sound input as many times as desired.

[0067] Turning now to FIG. 4, the process 220 of FIG. 2 is illustratedin more detail. The process 220 allows the computer to automaticallycheck whether the student/user's sound input meets a predetermined soundinput standard. The process 220 then analyzes the student/user soundinput (step 262).

[0068] From step 262, the process 220 determines whether the sound inputis acceptable relative to a predetermined standard (step 264). In oneembodiment, confidence-level assessment technology may be used to dothis. If the sound input is acceptable, a green light is displayed (step265). If not, the process 220 then displays a failure indication, suchas a red light, and/or various remedial suggestions to the student/user(step 266). From step 264 or step 266, the process 220 further checkswhether the student/user wishes to retry the training process. If so,the process 220 loops back from step 268 to step 260 to continue thechecking process. Otherwise, the process 220 exits.

[0069] The process of FIG. 4 can, for example, allow the user to checkwhether consonants are properly formed and performed with sufficientintensity so the diction is made clear. Moreover, vowel issues such aspurity, shifts, diphthong, transitions initiations, staccato, endingsand transitions can be identified and resolved. In the same way, it canalso compare the stress and intonation of a communication trainee ordrama student's utterance with that of a model, or the timbre of thesound of the user's vibrating violin string with that of a model.

[0070] Turning now to FIG. 5, the compare module 230 of FIG. 2 is shownin more detail. First, the process 230 creates a waveform for the modelsound file, then a waveform for the student/user's input sound file.Then it overlays the latter on the former so the student/user cancompare them (step 272).

[0071] By overlaying the student/user's speech (or sound) waveform overthe model sound input waveform, the compare module 230 provides thestudent/user with visual feedback regarding his or her sound inputrelative to a “norm” or a standard sound input. The student/user caninteractively repeat his or her sound input until it satisfactorilyapproximates the model sound input.

[0072] Alternatively or additionally, the waveforms may be simplified asgraphical representations so that the graphics need not follow exactlythe student/user's sound waveform. Thus, speech information may beextrapolated and simplified representations can be delivered as feedbackfor the student/user.

[0073] Additionally, the process 230 can analyze model and student/usersound files and display spectrograms to allow student/users to visuallycompare his or her sound input with the model sound input. In thespectrogram embodiment, time is plotted along the horizontal axis, andfrequency is plotted along the vertical axis. The intensity of the soundat each particular time is represented by the amount of shading in thegraph.

[0074] As discussed above, the student/user can manually review thedisplays of the sound input waveforms or spectrograms and repeat thesound input process if necessary. Alternatively, the process 230 candetermine whether the sound input meets a predetermined standard (step274). In one embodiment, this is done by computing waveform differencesor spectrogram differences between the student/user's sound input andthe model sound input, and if the differences exceed a predeterminedtolerance range, the process brings up a graph that indicates thedeviation to the student/user who then can refine his or her soundinput.

[0075] Thus, if the deviation is significant, the process 230 highlightsthe differences (step 276) and in effect, prompts the student/user todecide as to whether or not he or she wishes to retry the trainingprocess (step 278). From step 278, in the event that the student/userdoes not wish to retry the lesson or exercise, the process 230 exits.

[0076] The analysis in step 274 can also be done using a number of audioor speech processing functions which essentially analyze a complexsignal such as the voice as being made up of the sum of sound waves ofmany different frequencies. The vibration of the vocal folds produces aseries of harmonics. The lowest harmonic is called the fundamentalfrequency, and it is typically about 100 Hertz.

[0077] In one embodiment, the analysis uses spectrum analysis. In thiscase, speech sounds, especially vowels, have distinctive signatures(patterns of bands at certain frequencies). Vowels, for example, areidentifiable by two or three bands of energy (called “formants”) atcertain intervals, or in the case of diphthongs, movement of the bandsover time. These signatures may be revealed using a sound spectrogram,which is a plot of the amount of energy at various frequencies overtime.

[0078] The student/user's sound can be represented as a combination ofsine waves of various frequencies. Fourier analysis is applied to thespeech waveform to discover the presence of frequencies at any givenmoment in the speech signal. The Fourier transform can analyze a signalin the time domain for its frequency content. The transform works byfirst translating a function in the time domain into a function in thefrequency domain. The signal can then be analyzed for its frequencycontent because the Fourier coefficients of the transformed functionrepresent the contribution of each sine and cosine function at eachfrequency.

[0079] The result of Fourier analysis is a spectrum, or the intensitiesof various sine waves that are the components of that sound. Aftercomputing the spectrum for one short section or window (typically 5 to20 milliseconds) of speech, the spectrum for the adjoining window isthen computed until the end of the waveform is reached.

[0080] The spectra computed by the Fourier transform are displayedparallel to the vertical or y-axis. For a given spectrogram, thestrength of a given frequency component at a given time in the speechsignal is represented by the darkness or color of the correspondingpoint of the spectrogram. In one embodiment, the resulting spectrogrammay be displayed in a gray-scale rendition, where the darkness of agiven point is proportional to the energy at that time and frequency.However, color may be used as well. In this embodiment, the waveform andspectrogram for the same segment of speech may be shown one on top ofthe other so that it is easy to see the relation between patterns in thewaveform and the corresponding patterns in the spectrogram.

[0081] In this manner, the process of FIG. 5 allows the student/user tocheck whether consonants are properly formed and performed withsufficient intensity so the diction is made clear. Moreover, vowelissues such as purity, shifts, dipthong, transitions, initiations,staccato, endings and transitions can be identified and resolved. In thesame way, it can also compare the stress and intonation of acommunication trainee or drama student's utterance with that of a model,or the timbre of the sound of the user's vibrating violin string withthat of a model.

[0082] Although spectrogram analysis is used, other analytical methodsmay be used as well. For instance, linear prediction analysis of speechmay be used. The basis is a source-filter model where the filter isconstrained to be an all-pole linear filter. This amounts to performinga linear prediction of the next sample as a weighted sum of pastsamples. In other embodiments, formant analysis may be used. Formantsare perceptually defined, and the corresponding physical property is thefrequencies of resonances of the vocal tract. Additionally, non-linearfrequency scales that approximate the sensitivity of the human ear maybe used, including Constant Q where Q is the ratio of filter bandwidthover center frequency, hence this implies a exponential form; EquivalentRectangular bandwidth (ERB), where the bandwidths of the auditoryfilters are measured; Bark, which is derived from perceptionexperiments; Basilar membrane, which is the distance measured along thebasilar membrane; and Mel, an engineered solution to non-linearfrequency scales.

[0083] Additionally, neural networks can be used to perform theanalysis. A neural network consists of a collection of cells which areinterconnected, where every connection has an associated positive ornegative number, called a weight or component value. Each cell employs acommon rule to compute a unique output, which is then passed alongconnections to other cells. The particular connections and componentvalues determine the behavior of the network. Through a trainingprocess, the component values are set. During operation, data is“weighted” using the component values, and the outputs of the cells arecumulated from one layer of cells to the next layer of layers until theoutputs propagate to the output cell.

[0084] Referring now to FIG. 6, the archive module 240 of FIG. 2 isillustrated. The process 240 prompts the student/user as to whether ornot the sound input is to be archived (step 292). If so, thestudent/user's sound input file can be compressed before it is uploadedto a remote server (step 294). The file transfer can use a number ofsuitable transfer protocols such as the Internet file transfer protocol(FTP), among others. From step 292 or 294, upon receiving anacknowledgment of a successful file transfer, the process of FIG. 6exits.

[0085] Turning now to FIG. 7, a server process 300 is shown. The serverprocess 300 handles requests and commands from the student/userworkstation 112 or 114. First, the process 300 checks whether thestudent/user is logged in (step 302). If so, the process 300authenticates the student/user (step 304). From step 306, if thestudent/user is authorized, the process 300 accepts commands andrequests from the student/user workstation 112 or 114. First, theprocess 300 checks whether the request is a download request (step 308).If so, the server process 300 sends the invention (in the form of afloating-type panel of buttons that delivers a variety of componentfunctions) along with a requested source multimedia file to thestudent/user workstation 112 or 114 if the server is supplying both(step 310); if not, the multimedia source file may come from anotherserver.

[0086] From step 308, in the event that the request is not a downloadrequest, the server process 300 checks whether the request is an uploadrequest (step 312). If so, the process 300 receives the student/usermultimedia data file and stores it on the network to be retrieved andreviewed later by the student/user, an instructor, or another person(step 314). Additionally, data associated with the student/user can beprovided to an educational institution for subsequent review by aneducator.

[0087] From step 312, in the event that the request is neither adownload or an upload request, the server process 300 checks whether thestudent/user connection has timed out (step 316). If not, the process300 loops back to step 308 to continue processing student/user requests.

[0088] The techniques described here may be implemented in hardware orsoftware, or a combination of the two. Preferably, the techniques areimplemented in computer programs executing on programmable computersthat each includes a processor, a storage medium readable by theprocessor (including volatile and nonvolatile memory and/or storageelements), and suitable input and output devices. These programmablecomputers include workstations, desktop computers, handheld computersand computer appliances. Program code is applied to data entered usingan input device to perform the functions described and to generateoutput information. The output information is applied to one or moreoutput devices.

[0089]FIG. 8 illustrates one such computer system 600, including a CPU610, a RAM 620, a ROM 622 and an I/O controller 630 coupled by a CPU bus698. The I/O controller 630 is also coupled by an I/O bus 650 to inputdevices such as a keyboard 660 and a mouse 670, and output devices suchas a monitor 680. A modem 682 is attached to the I/O bus 650 to allowthe student/user to connect to the Internet and to communicate withother networks.

[0090] Additionally, a sound board 684, such as a SoundBlaster soundboard, is connected to the I/O bus 650. A microphone 686 is connected tothe input of the sound board 684 to capture the student/user's voice orinstrument. The microphone is an acoustic to electronic transducer. Itsinternal diaphragm sympathetically moves from the compression andrarefaction of sound wave energy that reaches it. This movement of thediaphragm is converted to an electronic signal. A conventionalmicrophone or a noise canceling microphone, which measures the pressuredifference in a sound wave between two points in space, may be used.Noise canceling microphones are advantageous in noisy environments sincethey pick up desired sounds that are close to the student/user whilerejecting unwanted noise that is farther away.

[0091] The I/O controller 630 also drives an I/O interface 690 which inturn controls a removable media drive 692. Typically, memory media suchas a floppy disk, CD ROM, or Digital Video Disk can contain the programinformation for controlling the computer to enable the computer toperforms its functions in accordance with the invention.

[0092] Variations are within the scope of the following claims. Forexample, instead of using a mouse as the input devices to the computersystem 600, a pressure-sensitive pen or tablet may be used to generatethe cursor position information. Moreover, each program is preferablyimplemented in a high level procedural or object-oriented programminglanguage to communicate with a computer system. However, the programscan be implemented in assembly or machine language, if desired. In anycase, the language may be a compiled or interpreted language.

[0093] Each such computer program is preferably stored on a storagemedium or device (e.g., CD-ROM, hard disk or magnetic diskette) that isreadable by a general or special purpose programmable computer forconfiguring and operating the computer when the storage medium or deviceis read by the computer to perform the procedures described. The systemalso may be implemented as a computer-readable storage medium,configured with a computer program, where the storage medium soconfigured causes a computer to operate in a specific and predefinedmanner.

[0094] Additionally, although embodiments of the invention are describedas sound-based training systems, video-based training can also be done.In these embodiments, the student/user workstation captures videoinformation from the student or user and communicates this informationacross a network. Multipoint conferencing can also be provided overcircuit-switched communication networks. Moreover, other multimediasystems which support long distance communication of coordinated voice,video and data can also be used in conjunction with the invention.

[0095] While the invention has been shown and described with referenceto an embodiment thereof, those skilled in the art will understand thatthe above and other changes in form and detail may be made withoutdeparting from the spirit and scope of the following claims.

What is claimed is:
 1. A method for training using a network,comprising: capturing multimedia data from a user; providing feedback tothe user by allowing the user to play and capture multimedia data; andarchiving the captured multimedia data over a network.
 2. The method ofclaim 1, further comprising downloading an applet-type program forcapturing the multimedia data from a user and one or more multimediasource files.
 3. The method of claim 2, further comprising comparing thecaptured multimedia data against the one or more multimedia sourcefiles.
 4. The method of claim 3, further comprising displaying waveformsassociated with the captured multimedia data to the user for review. 5.The method of claim 3, further comprising displaying spectrogramsassociated with the captured multimedia data to the user for review. 6.The method of claim 3, further comprising displaying a spectrogramassociated with the captured multimedia data over a spectrogramassociated with one or more multimedia source files.
 7. The method ofclaim 3, wherein the multimedia data is speech or audio data.
 8. Themethod of claim 3, wherein the multimedia data is video data.
 9. Themethod of claim 2, wherein the applet-type program is a Java Soundapplet.
 10. The method of claim 1, further comprising storing thecaptured multimedia data in a data storage structure.
 11. The method ofclaim 1, further comprising uploading the captured multimedia data to aremote server.
 12. A network training system, comprising: means forcapturing multimedia data from a user; means for providing feedback tothe user by allowing the user to play and capture multimedia data; andmeans for archiving the captured multimedia data over a network.
 13. Thesystem of claim 12, further comprising means for downloading anapplet-type program for capturing the multimedia data from a user andone or more multimedia source files.
 14. The method of claim 13, furthercomprising means for comparing the captured multimedia data against theone or more multimedia source files.
 15. The method of claim 13, furthercomprising means for displaying waveforms associated with the capturedmultimedia data to the user for review.
 16. The method of claim 13,further comprising means for displaying spectrograms associated with thecaptured multimedia data to the user for review.
 17. An educationalworkstation, comprising: a processor; a display device coupled to theprocessor; a network interface device coupled to the processor to allowthe processor to communicate with a server over a network a sound systemcoupled to the processor; a data storage device coupled to theprocessor, the data storage device adapted to store instructions to:capture multimedia data from a user; provide feedback to the user byallowing the user to play and capture multimedia data; and archive thecaptured multimedia data on a server over the network.
 18. A remotetraining system, comprising: a server adapted to download instructionalmaterials over a network and to archive captured multimedia data overthe network; a workstation adapted to communicate with the server, theworkstation including means for capturing multimedia data from a user;means for providing feedback to the user by allowing the user to playand capture multimedia data; and means for archiving the capturedmultimedia data over a network.
 19. The method of claim 18, wherein theserver sends materials from a publisher.
 20. The method of claim 18,wherein the server sends materials from a content provider.